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PERIODS AND BORDERS OE RANDOM WORDS 


STEPAN HOLUB AND JEFFREY SHALLIT 


Abstract. We investigate the behavior of the periods and border lengths of 
random words over a fixed alphabet. We show that the asymptotic probabil¬ 
ity that a random word has a given maximal border length fc is a constant, 
depending only on k and the alphabet size £. We give a recurrence that allows 
us to determine these constants with any required precision. This also allows 
us to evaluate the expected period of a random word. For the binary case, the 
expected period is asymptotically about n — 1.641. We also give explicit for¬ 
mulas for the probability that a random word is unbordered or has maximum 
border length one. 


1. Introduction and notation 

A word is a finite sequence of letter chosen from a finite alphabet S. The peri¬ 
odicity of words is a classical and well-studied topic in both discrete mathematics 
and combinatorics on words, starting with the classic paper of Fine and Wilf [5] 
and continuing with the works of Guibas and Olydzko laiais]. For more recent 
work, see, for example, [giiiiiii]. 

We say that a word w has period p ii w[i] = r(;[z -I- p] for all i that make the 
equation meaningful. (If |z(;| = n and one indexes beginning at position I, this 
would be for 1 < i < n — p.) Trivially every word of length n has all periods 
of length > n, so we restrict our attention to periods < n. The least period is 
sometimes called the period. For example, the French word entente has periods 3, 
6, and 7. 

Empirically, one quickly discovers that a randomly chosen word typically has 
a least period that is very close to its length. This readily follows from the fact 
that the number of words over a given alphabet grows exponentially as the length 
increases. It can also be seen as a particular case of the fact that most strings are 
not compressible. 

In this paper, we quantify this basic observation and show that the expected 
least period of a string of length n over an Gletter alphabet is n — ai{n), where 
a{{n) is 0(1). 

Another concept frequently studied in formal language theory is that of border 
of a word nans]. A word X has border ic if w is both a prefix and a suffix of 
X. Normally we do not consider the trivial borders of length 0 or n = |u;|. Thus, 
for example, the English word ionization has one border: ion. Less trivially, the 
word alfalfa has two borders: a and alfa. A word with no borders is unhordered. 

There is an obvious connection between periods of a word and its borders: if w 
has a period p, then it has a border of length |u'| — p. For example, the English 
word abracadabra, of length 11, has periods 7, 10, and 11, while it has borders of 
length 1 and 4. 
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Consequently, the least period of a word corresponds to the length of the longest 
border (and an unbordered word corresponds to least period n, the length of the 
word). The reader should be constantly aware of this duality, since it is often 
useful and more natural to think about periods in terms of borders. This can be 
seen from the announced result: it is more compact to speak directly about the 
expected maximum border length, which is ag{n). 

If P is a set of integers, we shall write n — P tor {n — p \ p G P}, and P — n for 
{p — n \ p G P}. 

By prefj(w), we mean the prefix of length i of the word v. 

2. MuLTIPERIODIC words and the AVERAGE BORDER LENGTH 

We shall obtain our results by counting words with a given length n and a given 
finite set of periods P C {l,2,...n}, or equivalently, with a given set of border 
lengths n — P. For technical reasons, in order to be able to deal with unbordered 
words, we shall always suppose that n G P, that is, we shall say that every word 
has a border length zero. 

There are two basic types of requirements. Let 

Ge{P, n) = {w GT,^ \ for each p in P, p is a period of w} , 
and let Gi{P,n) be the cardinality of Qg{P,n). Similarly, let 

Pe{P, n) = {w G Gt{P, n) \ minP is the least period of rc} , 

and let F{P,n) be the cardinality of Pi{P,n). 

Words with many periods have been amply studied. In particular, there is a fast 
algorithm constructing a word of length n with periods P and maximal possible 
number of letters. Such a word, called an FW-word in the literature, is unique up 
to renaming of the letters. Let c(P, n) denote the cardinality of the alphabet of the 
FW-word of length n and periods P. 

Example I. Let P = {p,q\ and d = gcd(p, g). The well-known periodicity lemma 
(often called the Fine and Wilf theorem, which is the origin of the term FW-word) 
states that if a word of length at least p-\-q — d has periods p and g, then it also has 
period d. Moreover, the bound p -I- g — c? is sharp; for all p, g > I there are words 
of length p -|- g — d — 1 with period p and g but not period d. This can be stated, 
using the just-introduced terminology, by the two assertions c({p, g},p -I- g — d) = d 
and c({p, g},p -I- g — d — 1) > d. 

The number c(P, n) can be computed and the corresponding FW-word con¬ 
structed using the algorithm of Tijdeman and Zamboni m (see [5] for an alterna¬ 
tive presentation). The computation is summarized by the following formula: 


c(P,n) 


1 , 


n, 

c(Q, n 
c{Q, n 


if 771 = 1; 
if 771 > n; 

to), if 2 to < n; 

to) -|- 2to — 77, if TO < n < 2 to; 


where to = minP and Q = {P — to) \ {0} U {to}. 
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Since each word having the periods in P (and possibly others) results from a 


coding (a letter-to-letter mapping) of the corresponding FW-word, we obtain 


Gf(P,n) 


which is the starting point of our computation. 

Note that Pe{{p},n) is the set of words from having least period p. Equiv¬ 


alently, Pe{{n — r},n) is the set of words with the longest border of length r. For 
0 < r < n, let 



denote the relative number of such words. Our goal is to compute 

n—1 



which is the expected maximum border length for words in E”. We first show that 
this quantity converges as n approaches infinity. 

Lemma 2. For each £> 2 and each 0 < r < n, 


|A^(r,n-f 1) - \i{r,n)\ < 


Proof. Case 1: r > \ n/2\. Then 


Fl{{n+l-r),nPl) F^{{n-r),n) 


|A^(r,n -b 1) - Xi{r,n)\ 


£n+l 



Recall that Fg{{n -b 1 — r}, n -b 1) (resp., Fi{{n — r},n)) counts the words with 
longest border length r from (resp., E^). First, note that Fi{{p},n) < for 

any p and n. This implies 



and we are done. 

Case 2: r < [n/2j. There is a useful correspondence between E^ and E"'*'^, 
given by the insertion of a letter in the middle of the shorter word. The basic 
observation, already used in [Hilo], is that this insertion does not influence borders 
of length at most \n/2\. Define 

F = Ft{{n -b 1 - r}, n -b 1), 

B = {wiaw 2 I a e E^, Iwil = [n/2j , |w 2 | = r^/2] , WiW 2 G Fi,{{n - r},n)} . 

Then \B\ = £ ■ Fi{{n — r},n). Let w G F \ B and write w = wiaw 2 with a G E^, 
jicil = L^/2J, and |w 2 | = \'n/2'\. The words w and wiW 2 have the same borders up 
to length [n./2j. Since wiW 2 ^ Ft,{{n — r},n), we deduce that wiW 2 has a border 
of length at least [n/2j -b 1, that is, a period at most [n/2] — 1. This implies 


\n/2\-l 


\F\B\<£- P<£r"/2l+i_ 


( 1 ) 
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Similarly, a word w G B\T has period at most |"n/2], and so 


( 2 ) 



We thus obtain 


|A,(r, n + 1) - X,{r, n)\ = :F\ - \^ \ B\ \ < ^ 


□ 


Theorem 3. For each £ >2 and r > 0, the limits 

at := lim ai{n) and \i(r) = lim Xe(r,n) 



exist. Furthermore, the convergence is exponential. 

Proof. Follows directly from the definition of ae{n) and Lemma 


□ 


3. Recurrences 


From the estimates of the previous section, we know that ai{n) and Xe{r, n) both 
converge quickly to ai and Af(r), respectively. Thus, they can be estimated to a 
few digits by explicit enumeration. 

In order to evaluate ai{n) to dozens of decimal places, however, we need a more 
efficient way to calculate Fi{{p}, n). This can be done using the recurrence formulas 
that we derive below. They are reformulations and generalizations of formulas given 
by Harborth [7] for sets of periods. 

We first prove the following auxiliary claim. 

Lemma 4. Let a word w have a period p < |w| and let u be the prefix of w of 
length |rc| — p. Then w has a period q > p if and only if u has a period q — p. 

Proof. Note that u is a border of w. The following conditions are easily seen to be 
equivalent: 

• w has a period q, 

• w has a border of length jwl — q, 

• u has a border of length |w| — q, 

• u has a period |m| — (|w| — q). 

Since |u| — (|ty| — q) = (|w| — p) — (|jc| — q) = q — P, the proof is completed. □ 
Theorem 5. Let P be a set of periods with m = minP and maxP < n. Then 


ra—1 


(3) 




p=rm/2] 


where 


(4) 


Hi{P,p,n) : 


Fi{{P-p)\j{p},n-p), ifp<\nl2\, 

■ Fi{P-p,n-p), ifp>\n/2']. 
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Proof. From Gi{P, n) we have to subtract the number of words from that have 
periods P but also a period smaller than m. We define, for each 1 < p < m, the set 

'Hi{P,p, n) = {w S E" I re has periods P U {p}, and no period p' with p < p' < m} . 

li p < [m/2] then 'Hi{P,p,n) is empty, since a word w G 'Hi{P,p,n) also has a 
period 2p, and p < 2p < m contradicts the definition of 'H(p). Moreover, the sets 
7^(P,p, n) are pairwise disjoint, and 

m— 1 

g,{P, n) \ Pe{P, n)= IJ ni{P,p, n). 

p=lm/2] 

It remains to show that H{p) is the cardinality of 'Hi{P,p, n) for each [m/2] < p < 
m — 1. 

Let p < [n/2]. We claim that w i—>■ pref„_prc is a one-to-one mapping of 
Hg{P,p,n) to Pi{{P — p) U {p},n — p). Let w G 'Hi{P,p,n). By Lemmaffl the 
word pref„_piy has periods P — p and no period p' — p with p < p' < m, that is, 
no period less than m — p. Since m — p = min ((P — p) U {p}) and since pref„_p'u; 
also has a period p, we have pref„_prc G ^i{{P ~ p) U {p}, n — p). Similarly, one 
can verify that if u G J/((P — p) U {p}, n — p), then Wv := (prefpR)”/^’ G 'Hg{P,p,n) 
and pref„_pr(;„ = v. 

Let p > [n/2]. Again, using Lemmait is straightforward to verify that 
7ig{P,p,n) = {vuv I V G Pe{P - p,n - p),u G . 

□ 

If min P is small, then we can formulate a more explicit formula that uses the 
Mobius /i-function. 

Lemma 6. Let P be a set of periods with m = minP < [n/2] -1- 1. Then 
(5) P,(P,n)=^p(^)GaPU{d},n). 

d\m 

Proof. Let ic be a word of length n with a period m and let p be the least period of 
w. Then, by the periodicity lemma, we have that p divides m, since p < m implies 
p -b m — 1 < n. Therefore, for each divisor p of m, 

Ge{PU {p} ,n) = '^Fi{PU {d} ,n), 

d\p 

and the claim follows from Mobius inversion. □ 

4. Explicit formulas 

In this section we derive explicit formulas for A£(0) and A^(l), which are the 
asymptotic probabilities that a random word is unbordered, or has longest border 
of length one, respectively. These are two cases in which Lemmaj^yields a relatively 
simple expression, since [m/2] > [n/2j. 


6 


STEPAN HOLUB AND JEFFREY SHALLIT 


4.1. Unbordered words. The number of unbordered words satsifies a well known 
recurrence formula (see, e.g., |71 p. 143, Eq. (34)] for the binary case and (TU] for 
the general case). The formula can be verified using Theorem]^ but we shall give 
an elementary proof. In this section, let Un denote E)({n},n), and let t{n) denote 
A^(0,n). 

Theorem 7. 

*/« = !; 
ifn = 2] 
if n > 3 is odd; 
if n > i is even. 

Proof. For fc = 1, 2, the verification is straightforward. Let x and y be nonempty 
words with |a;| = \y\ and consider words xy, xay and xaby where a and b are letters. 

Since the shortest border of xay has length at most jxj, the word xy is unbordered 
if and only if xay is. This proves Un = i ■ Un-i if n is odd. 

On the other hand, xaby can have the shortest border of length |a;| + 1. Therefore, 
xaby is unbordered if and only if (i) xy is unbordered and (ii) xa ^ by. Since the 
shortest border is itself unbordered, we obtain Un = ■Un -2 — Un /2 = ^'Un-i — Unj^ 

if n is even. □ 

Theorem directly yields, for each n > 1, 

t{2n + 1) = t{2n) = t{2n — 1) — . 


Un = 


- * Ufi—l j 

Un—l ’^'nj‘1'1 


Therefore 

2n n 

t{2n) = t(l) + - t(i - 1)) = 1 - ^ ■ 

i=2 j = l 

Defining the generating function Lq(x) = X]ra>i tiu,)x'^, we get 

(6) lim A<;(0,n) = 1 - Lo fyV 

ra—>00 V ■^ / 

The next step is to obtain a functional equation for Lq(x): 


Therefore 


Lo{x){l — x) = t{l)x + ^(t(fc) — t{k — l))a:^ = 

k>2 

= + ^{t{2j) - t{2j - l))x‘^^ = 

i>i 

= t{l)x — X^^ = X — Lq{x'^ fi) . 

i>i 


Lo{x) 


X Lo(a;^/t') 

1 — a; 1 — a; 
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Successively substituting x = 1/^, x = l/£^, a: = l/£^, ..., we get 



= ^ 3 _ 1 “ ^ 3 _ l) ^0 


Ln 


£ 2 *-l / ^ 2 --l _ 1 


- 1 + 


£ 2 --l _ 1 


Ln 


1 


£2-+i-l I ■ 


Since it is easy to see that 


n —1 


— 

n—>-oo ^ — 


inb 


£2‘-l _ 1 


£2" 


3T =0, 


we obtain 



A similar analysis was given previously by [T], although our analysis is slightly 
cleaner. 


4.2. Words with longest border of length 1. There is also a relatively simple 
recurrence for F({{n — 1}, n), that is, for words with the longest border of length 1. 
The particular case £ = 2 was previously given by Harborth [71 p. 143, Eq. (36)]. 
In this section, we let Vn denote Fi{{n — l},n), and let s(n) denote Xi{l,n). 

Theorem 8. 


j/n = 1; 
ifn = 2; 
if n> 3 is odd; 
if n > 4 is even. 

Proof. Verify that ui = 0 and V 2 = £, and let x and y be nonempty words with 
jxj = \y\. Consider words cxyc, cxayc and cxabyc where a, b, c are (not necessarily 
distinct) letters. 

The letter c is the longest border of the word cxayc if and only if (i) c is the 
longest border of cxyc and (ii) cxa aye. Moreover, (i’) c is the shortest border of 
cxyc, and (ii’) cxa = aye (= cxc) if and only if c is the shortest border of cxc. This 
implies Vn = £■ Vn-i — r’(n+i )/2 for n > 3 odd. 

Similarly, c is the shortest border of cxabyc if and only if (i) c is the longest border 
of cxyc and (ii) cxa ^ bye. As above, we have to subtract the number of words cxc 
with the longest border c. It follows that Vn = P ■ n „_2 — w „/2 = £vn-i + {£— l)wn /2 
for n > 4 even. □ 


(- 

I ^ * Vn —1 '^(n+l)/ 2 ; 

* Vn—1 {-^ 
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From Theorem we deduce 

(7) s(2n) — s(2n — 2) = 

(8) s(2n) — s(2n —!) = (£— l)s(n)£~^, 

(9) s(2n + 1 ) — s(2n) =—s(n + 1)£~^, 

Using Q, we obtain 


n > 2, 
n > 2, 
n> 1. 


s{2n) = s{2) + ^(s(2j) - s(2j - 2)) = l/£-J2 ' 

r=j j=l 

Defining the generating function Li{x) = J2k>i s{k)x^, we then get 

A functional equation for Li is obtained as follows: 

Li(j;)(l — x) = s(l)a: + — s{k — l))x^ = 

k>2 

= jx^ + ^(s(2i + 1 ) - s(2i))x2*+i + ^(s(2i) - s(2i - l))x‘^^ = 


i>l 


i>2 


^x^ - 51 - l)sii)£-^ 


= 


i>l 


i>2 


1 2 i 


= -X-zLi ^ +ie-^)Li ^ . 


We have 


Li(a;) = 


£{1 — x) 


' — 1 — £lx ( x^ 

■^1 -IT 


1 — cc 


and 


Lx 


1 


1 


« 2 '-l 


-£ + \ 


J2^-l ) £ 2 *(£ 2 .- 1 _ 1 ) 

From here, we deduce 


Lx 




1 


L^ 


/n ^ 

^ \ II J 2^ (n+l 11 f2>-l _ I ■ 

^ n>l 2=1 


We do not know how to obtain similar expressions for other border lengths. 


5. Particular values 

Theorem as well as explicit formulas from the previous section allow fast 
computer evaluation of a^(n) and A^(r, n) for large n, and therefore also evaluation 
of X^(r) and with high precision. We list some rounded values in the following 
tables. 
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i 

ag 

2 

1.64116491178296695613 

3 

0.68587617299708343978 

4 

0.42195659003603599699 

5 

0.30201601806282253073 

10 

0.12233344445364555354 

50 

0.02081648979722449000 


r 

A2(r) 

0 

0.26778684021788911238 

1 

0.30042007151830329926 

2 

0.19891874779036456415 

3 

0.11216079483159432642 

5 

0.03044609816129782975 

10 

0.00097577734413168807 


And some values of X£{r) rounded to four decimal digits: 


A^(r) 

II 

CO 

II 

£ = 5 

O 

i-H 

II 

r = 0 

0.55698 

0.68775 

0.76006 

0.89000 

r = 1 

0.28270 

0.23024 

0.19034 

0.09890 

r = 2 

0.10547 

0.06126 

0.03961 

0.00999 

r = 3 

0.03641 

0.01555 

0.00798 

0.00100 


For example, we see that a long binary word chosen randomly has about 27% 
chance to be unbordered. A bit more probable, at 30%, is that such a word will 
have its longest border of length one. Over a five-letter alphabet, more than three 
words out of four are unbordered, on average. 

Figure shows the distribution of lengths of the shortest period for binary words 
of length n = 18. 



Figure 1. Distribution of lengths of shortest period for binary 
words of length 18 
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Our original motivation was a question about the average period of a binary 
word. The answer is, that the border of a binary word has asymptotically constant 
expected length, namely 

a2 = 1.64116491178296695612774416940082554065953687825771543... . 

6. Final remarks 

Recently there has been some interest in computing the expected value of the 
largest unbordered factor of a word [9]. This is a related, but seemingly much 
harder, problem. 
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